2 research outputs found
A Methodology for Generative Spelling Correction via Natural Spelling Errors Emulation across Multiple Domains and Languages
Modern large language models demonstrate impressive capabilities in text
generation and generalization. However, they often struggle with solving text
editing tasks, particularly when it comes to correcting spelling errors and
mistypings. In this paper, we present a methodology for generative spelling
correction (SC), which was tested on English and Russian languages and
potentially can be extended to any language with minor changes. Our research
mainly focuses on exploring natural spelling errors and mistypings in texts and
studying the ways those errors can be emulated in correct sentences to
effectively enrich generative models' pre-train procedure. We investigate the
impact of such emulations and the models' abilities across different text
domains. In this work, we investigate two spelling corruption techniques: 1)
first one mimics human behavior when making a mistake through leveraging
statistics of errors from particular dataset and 2) second adds the most common
spelling errors, keyboard miss clicks, and some heuristics within the texts. We
conducted experiments employing various corruption strategies, models'
architectures and sizes on the pre-training and fine-tuning stages and
evaluated the models using single-domain and multi-domain test sets. As a
practical outcome of our work, we introduce SAGE(Spell checking via
Augmentation and Generative distribution Emulation). It is a library for
automatic generative SC that includes a family of pre-trained generative models
and built-in augmentation algorithms.Comment: to appear in EACL 202
A Family of Pretrained Transformer Language Models for Russian
Nowadays, Transformer language models (LMs) represent a fundamental component
of the NLP research methodologies and applications. However, the development of
such models specifically for the Russian language has received little
attention. This paper presents a collection of 13 Russian Transformer LMs based
on the encoder (ruBERT, ruRoBERTa, ruELECTRA), decoder (ruGPT-3), and
encoder-decoder (ruT5, FRED-T5) models in multiple sizes. Access to these
models is readily available via the HuggingFace platform. We provide a report
of the model architecture design and pretraining, and the results of evaluating
their generalization abilities on Russian natural language understanding and
generation datasets and benchmarks. By pretraining and releasing these
specialized Transformer LMs, we hope to broaden the scope of the NLP research
directions and enable the development of industrial solutions for the Russian
language